{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/evaluation/RAGChecker.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# RAGChecker: A Fine-grained Evaluation Framework For Diagnosing RAG\n",
    "\n",
    "RAGChecker is a comprehensive evaluation framework designed for Retrieval-Augmented Generation (RAG) systems. It provides a suite of metrics to assess both the retrieval and generation components of RAG systems, offering detailed insights into their performance.\n",
    "\n",
    "Key features of RAGChecker include:\n",
    "- Fine-grained analysis using claim-level entailment checking\n",
    "- Comprehensive metrics for overall performance, retriever efficiency, and generator accuracy\n",
    "- Actionable insights for improving RAG systems\n",
    "\n",
    "For more information, visit the [RAGChecker GitHub repository](https://github.com/amazon-science/RAGChecker).\n",
    "\n",
    "## RAGChecker Metrics\n",
    "\n",
    "RAGChecker provides a comprehensive set of metrics to evaluate different aspects of RAG systems:\n",
    "\n",
    "1. Overall Metrics:\n",
    "   - Precision: The proportion of correct claims in the model's response.\n",
    "   - Recall: The proportion of ground truth claims covered by the model's response.\n",
    "   - F1 Score: The harmonic mean of precision and recall.\n",
    "\n",
    "2. Retriever Metrics:\n",
    "   - Claim Recall: The proportion of ground truth claims covered by the retrieved chunks.\n",
    "   - Context Precision: The proportion of retrieved chunks that are relevant.\n",
    "\n",
    "3. Generator Metrics:\n",
    "   - Context Utilization: How well the generator uses relevant information from retrieved chunks.\n",
    "   - Noise Sensitivity: The generator's tendency to include incorrect information from retrieved chunks.\n",
    "   - Hallucination: The proportion of incorrect claims not found in any retrieved chunks.\n",
    "   - Self-knowledge: The proportion of correct claims not found in any retrieved chunks.\n",
    "   - Faithfulness: How closely the generator's response aligns with the retrieved chunks.\n",
    "\n",
    "These metrics provide a nuanced evaluation of both the retrieval and generation components, allowing for targeted improvements in RAG systems."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install Requirements"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -qU ragchecker llama-index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup and Imports\n",
    "\n",
    "First, let's import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n",
    "from ragchecker.integrations.llama_index import response_to_rag_results\n",
    "from ragchecker import RAGResults, RAGChecker\n",
    "from ragchecker.metrics import all_metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a LlamaIndex Query Engine\n",
    "\n",
    "Now, let's create a simple LlamaIndex query engine using a sample dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load documents\n",
    "documents = SimpleDirectoryReader(\"path/to/your/documents\").load_data()\n",
    "\n",
    "# Create index\n",
    "index = VectorStoreIndex.from_documents(documents)\n",
    "\n",
    "# Create query engine\n",
    "rag_application = index.as_query_engine()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using RAGChecker with LlamaIndex\n",
    "\n",
    "Now we'll demonstrate how to use the `response_to_rag_results` function to convert LlamaIndex output to the RAGChecker format:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# User query and groud truth answer\n",
    "user_query = \"What is RAGChecker?\"\n",
    "gt_answer = \"RAGChecker is an advanced automatic evaluation framework designed to assess and diagnose Retrieval-Augmented Generation (RAG) systems. It provides a comprehensive suite of metrics and tools for in-depth analysis of RAG performance.\"\n",
    "\n",
    "\n",
    "# Get response from LlamaIndex\n",
    "response_object = rag_application.query(user_query)\n",
    "\n",
    "# Convert to RAGChecker format\n",
    "rag_result = response_to_rag_results(\n",
    "    query=user_query,\n",
    "    gt_answer=gt_answer,\n",
    "    response_object=response_object,\n",
    ")\n",
    "\n",
    "# Create RAGResults object\n",
    "rag_results = RAGResults.from_dict({\"results\": [rag_result]})\n",
    "print(rag_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluating with RAGChecker\n",
    "\n",
    "Now that we have our results in the correct format, let's evaluate them using RAGChecker:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize RAGChecker\n",
    "evaluator = RAGChecker(\n",
    "    extractor_name=\"bedrock/meta.llama3-70b-instruct-v1:0\",\n",
    "    checker_name=\"bedrock/meta.llama3-70b-instruct-v1:0\",\n",
    "    batch_size_extractor=32,\n",
    "    batch_size_checker=32,\n",
    ")\n",
    "\n",
    "# Evaluate using RAGChecker\n",
    "evaluator.evaluate(rag_results, all_metrics)\n",
    "\n",
    "# Print detailed results\n",
    "print(rag_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output will look something like this:\n",
    "\n",
    "```python\n",
    "RAGResults(\n",
    "  1 RAG results,\n",
    "  Metrics:\n",
    "  {\n",
    "    \"overall_metrics\": {\n",
    "      \"precision\": 66.7,\n",
    "      \"recall\": 27.3,\n",
    "      \"f1\": 38.7\n",
    "    },\n",
    "    \"retriever_metrics\": {\n",
    "      \"claim_recall\": 54.5,\n",
    "      \"context_precision\": 100.0\n",
    "    },\n",
    "    \"generator_metrics\": {\n",
    "      \"context_utilization\": 16.7,\n",
    "      \"noise_sensitivity_in_relevant\": 0.0,\n",
    "      \"noise_sensitivity_in_irrelevant\": 0.0,\n",
    "      \"hallucination\": 33.3,\n",
    "      \"self_knowledge\": 0.0,\n",
    "      \"faithfulness\": 66.7\n",
    "    }\n",
    "  }\n",
    ")\n",
    "```\n",
    "\n",
    "This output provides a comprehensive view of the RAG system's performance, including overall metrics, retriever metrics, and generator metrics as described in the earlier section."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Selecting Specific Metric Groups\n",
    "\n",
    "Instead of evaluating all the metrics with `all_metrics`, you can choose specific metric groups as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ragchecker.metrics import (\n",
    "    overall_metrics,\n",
    "    retriever_metrics,\n",
    "    generator_metrics,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Selecting Individual Metrics\n",
    "\n",
    "For even more granular control, you can choose specific individual metrics for your needs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ragchecker.metrics import (\n",
    "    precision,\n",
    "    recall,\n",
    "    f1,\n",
    "    claim_recall,\n",
    "    context_precision,\n",
    "    context_utilization,\n",
    "    noise_sensitivity_in_relevant,\n",
    "    noise_sensitivity_in_irrelevant,\n",
    "    hallucination,\n",
    "    self_knowledge,\n",
    "    faithfulness,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "This notebook has demonstrated how to integrate RAGChecker with LlamaIndex to evaluate the performance of RAG systems. We've covered:\n",
    "\n",
    "1. Setting up RAGChecker with LlamaIndex\n",
    "2. Converting LlamaIndex outputs to RAGChecker format\n",
    "3. Evaluating RAG results using various metrics\n",
    "4. Customizing evaluations with specific metric groups or individual metrics\n",
    "\n",
    "By leveraging RAGChecker's comprehensive metrics, you can gain valuable insights into your RAG system's performance, identify areas for improvement, and optimize both retrieval and generation components. This integration provides a powerful tool for developing and refining more effective RAG applications."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "authorship_tag": "ABX9TyMCOXlVTBhB2s4DNBHE8B6C",
   "include_colab_link": true,
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
